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ABSTRACT 

WaveCluster is an important family of grid-based clustering algo¬ 
rithms that are capable of finding clusters of arbitrary shapes. In 
this paper, we investigate techniques to perform WaveCluster while 
ensuring differential privacy. Our goal is to develop a general tech¬ 
nique for achieving differential privacy on WaveCluster that ac¬ 
commodates different wavelet transforms. We show that straight¬ 
forward techniques based on synthetic data generation and intro¬ 
duction of random noise when quantizing the data, though gener¬ 
ally preserving the distribution of data, often introduce too much 
noise to preserve useful clusters. We then propose two optimized 
techniques, PrivTHR and PrivTHR£;M, which can significantly re¬ 
duce data distortion during two key steps of WaveCluster: the quan¬ 
tization step and the significant grid identification step. We conduct 
extensive experiments based on four datasets that are particularly 
interesting in the context of clustering, and show that PrivTHR and 
PrivTHR£;M achieve high utility when privacy budgets are properly 
allocated. 

1. INTRODUCTION 

Clustering is an important class of data analysis that has been 
extensively applied in a variety of fields, such as identifying dif¬ 
ferent groups of customers in marketing and grouping homologous 
gene sequences in biology research Clustering results allow 
data analysts to gain valuable insights into data distribution when 
it is challenging to make hypotheses on raw data. Among vari¬ 
ous clustering techniques, a grid-based clustering algorithm called 
WaveCluster is famous for detecting clusters of arbitrary 

shapes. WaveCluster relies on wavelet transforms, a family of con¬ 
volutions with appropriate kernel functions, to convert data into a 
transformed space, where the natural clusters in the data become 
more distinguishable. 

In many data-analysis scenarios, when the data being analyzed 
contains personal information and the result of the analysis needs to 
be shared with the public or untrusted third parties, sensitive private 
information may be leaked, e.g., whether certain personal informa¬ 
tion is stored in a database or has contributed to the analysis. Con¬ 
sider the databases A and B in FigureThese two databases have 
two attributes. Monthly Income and Monthly Living Expenses, and 
the records differ only in one record, u. Without w’s participation 
in database A, WaveCluster identifies two separate clusters, marked 
by blue and red, respectively. With w’s participation, WaveCluster 
identifies only one cluster marked by color blue from database B. 
Therefore, merely from the number of clusters returned (rather than 
which data points belong to which cluster), an adversary may infer 
a user’s participation. Due to such potential leak of private infor¬ 
mation, data holders may be reluctant to share the original data or 
data-analysis results with each other or with the public. 



Figure 1: Example of personal privacy breach in cluster analysis. 

In this paper, we develop techniques to perform WaveCluster 
with differential privacy |[^[^. Differential privacy provides a 
provable strong privacy guarantee that the output of a computation 
is insensitive to any particular individual. In other words, based on 
the output, an adversary has limited ability to make inference about 
whether an individual is present or absent in the dataset. Differen¬ 
tial privacy is often achieved by the perturbation of randomized al¬ 
gorithms, and the privacy level is controlled by a parameter e called 
“privacy budget”. Intuitively, the privacy protection via differential 
privacy grows stronger as e grows smaller. 

WaveCluster provides a framework that allows any kind of wavelet 
transform to be plugged in for data transformation, such as the 
Haar transform (4j and Biorthogonal transform p8) . There are 
various wavelet transforms that are suitable for different types of 
applications, such as image compression and signal processing Q. 
Plugged in different wavelet transforms, WaveCluster can leverage 
different properties of the data, such as frequency and location, for 
finding the dense regions as clusters. Thus, in this paper, we aim 
to develop a general technique for achieving differential privacy on 
WaveCluster that accommodates different wavelet transforms. 

We first consider a general technique. Baseline, that adapts exist¬ 
ing differentially private data-publishing techniques to WaveClus¬ 
ter through synthetic data generation. Specifically, we could gener¬ 
ate synthetic data based on any data model of the original data that 
is published through differential privacy, and then apply WaveClus¬ 
ter using any wavelet transform over the synthetic data. Baseline 
seems particularly promising as many effective differentially pri¬ 
vate data-publishing techniques have been proposed in the litera¬ 
ture, all of which strive to preserve some important properties of 
the original data. Therefore, hopefully the “shape” of the original 
data is also preserved in the synthetic data, and consequently could 
be discovered by WaveCluster. Unfortunately, as we will show later 
in the paper, this synthetic data-generation technique often can¬ 
not produce accurate results. Differentially private data-publishing 
techniques such as spatial decompositions adaptive-grid p^ . 






5?^°^ . - . ,_, 4.5 

1 • o| 

xIO^ 

: .. . : 


2 • • ' “V-O'* " 2 

■ 


} 

■^.4 7.6 7.8 8 8.2 8.4 8.6 8.8 9 9.2 9.4 

X 10^ 

(a) Original 

7 7.5 8 8.5 9 

(b) Baseline 


Figure 2: Inaccurate clustering result produced by Baseline, (a) 
shows the WaveCluster results on the original data and (b) shows 
the WaveCluster results of Baseline, which leverages the adaptive- 
grid approach to generate the synthetic data. Points in different 
clusters are shown in different colors, and the points marked by red 
are considered as noises that do not form a cluster. 

and Privelet j^, output noisy descriptions of the data distribution 
and often contain negative counts for sparse partitions due to ran¬ 
dom noise. These negative counts do not affect the accuracy of 
large range queries (which is often one of the main utility mea¬ 
sures in private data publishing) since zero-mean noise distribution 
smoothes the effect of negative counts. However, negative counts 
cannot be smoothed away in the synthesized dataset, which are typ¬ 
ically set to zero counts. Figureshows an example of inaccurate 
clustering results produced by Baseline using adaptive-grid p^ . 
As we can see, the synthetic data generated in Baseline significantly 
distorts the data distribution, causing two clusters to be merged as 
one and reducing the accuracy of the WaveCluster results. 

Motivated by the above challenge, we propose three techniques 
that enforce differential privacy on the key steps of WaveCluster, 
rather than relying on synthetic data generation. WaveCluster ac¬ 
cepts as input a set of data points in a multi-dimensional space, 
and consists of the following main steps. First, in the quantiza¬ 
tion step WaveCluster quantizes the multi-dimensional space by 
dividing the space into grids, and computes the count of the data 
points in each grid. These counts of grids form a count matrix 
M. Second, in the wavelet transform step WaveCluster applies a 
wavelet transform on the count matrix M to obtain the approxima¬ 
tion of the multi-dimensional space. Third, in the significant grid 
identification step WaveCluster identifies significant grids based on 
the pre-defined density threshold. Fourth, in the cluster identifi¬ 
cation step WaveCluster outputs as clusters the connected compo¬ 
nents from these significant grids p^ . To enforce differential pri¬ 
vacy on WaveCluster, we first propose a technique, PrivQT, that 
introduces Laplacian noise to the quantization step. However, such 
straightforward privacy enforcement cannot produce usable private 
WaveCluster results, since the noise introduced in this step sig¬ 
nificantly distorts the density threshold for identifying significant 
grids. To address this issue, we further propose two techniques, 
PrivTHR and PrivTHR£;M, which enforce differential privacy on 
both the quantization step and the significant grid identification 
step. These two techniques differ in how to determine the noisy 
density threshold. We show that by allocating appropriate budgets 
in these two steps, both techniques can achieve differential privacy 
with significantly improved utility. 

Traditionally, the effectiveness of WaveCluster is evaluated through 
visual inspection by human experts (i.e., visually determining whether 
the discovered clusters match those refiected in the user’s mind) 

\S6\ . Unfortunately, visual inspection is inappropriate to assess the 
utility of differentially private WaveCluster. Visual inspection is 


not quantitative, and thus it is hard to systematically compare the 
impact of different techniques through visual inspection. Gener¬ 
ally, researchers use quantitative measures to assess the utility of 
differentially private results, such as relative or absolute errors for 
range queries and prediction accuracy for classification. But there 
is no existing utility measures for density-based clustering algo¬ 
rithms with differential privacy. 

To mitigate this problem, in this paper we propose two types of 
utility measures. The first is to measure the dissimilarity between 
true and private WaveCluster results by measuring the differences 
of significant grids and clusters, which correspond to the outputs of 
the two key steps (the significant grid identification and the clus¬ 
ter identification) in WaveCluster. To more intuitively understand 
the usefulness of discovered clusters, our second utility measure 
considers one concrete application of cluster analysis, i.e., to build 
a classifier based on discovered clusters, and then use that clas¬ 
sifier to predict future data. Therefore the prediction accuracy of 
the classifier from one aspect reflects the actual utility of private 
WaveCluster. 

To evaluate the proposed techniques, our experiments use four 
datasets containing different data shapes that are particularly in¬ 
teresting in the context of clustering Our results show that 

PrivTHR and PrivTHR£;M achieve high utility for both types of util¬ 
ity measures, and are superior to Baseline and PrivQT. 

2. RELATED WORK 

The syntactic approaches for privacy preserving clustering fTS) 
is to output /c-anonymous clusters. Friedman et al. Gz) presented 
an algorithm to output /c-anonymous clusters by using minimum 
spanning tree. Karakasidis et al. p4) created /c-anonymous clusters 
by merging clusters so that each cluster contains at least k key val¬ 
ues of the records. Fung et al. proposed an approach that con¬ 
verts the anonymity problem for cluster analysis to the counterpart 
problem for classification analysis. Aggarwal et al. Q proposed a 
perturbation method called r-gather clustering, which releases the 
cluster centers, together with their sizes, radiuses, and a set of as¬ 
sociated sensitive values. However, these approaches only satisfy 
syntactic privacy notions such as k-anonymity, and cannot provide 
formal guarantees of privacy as differential privacy. 

In this work, our goal is to perform WaveCluster under differen¬ 
tial privacy. The focus of initial work on differential privacy (H- 
[T5l[25) concerned the theoretical proof of its feasibility on various 
data analysis tasks, e.g., histogram and logistic regression. 

More recent work has focused on practical applications of differ¬ 
ential privacy for privacy-preserving data publishing. An approach 
proposed by Barak et al. encoded marginals with Fourier co¬ 
efficients and then added noise to the released coefficients. Hay 
et al. exploited consistency constraints to reduce noise for 
histogram counts. Xiao et al. proposed Privelet, which uses 
wavelet transforms to reduce noise for histogram counts. Cormode 
et al. fT0| indexed data by M-trees and quad-tvQQS, developing ef¬ 
fective budget allocation strategies for building the noisy trees and 
obtaining noisy counts for the tree nodes. Qardaji et al. pro¬ 
posed uniform-grid and adaptive-grid methods to derive appropri¬ 
ate partition granularity in differentially private synopsis publish¬ 
ing. Xu et al. | |40) proposed the NoiseFirst and Structure First tech¬ 
niques for constructing optimal noisy histograms, using dynamic 
programming and Exponential mechanism. These data publish¬ 
ing techniques are specifically crafted for answering range queries. 
Unfortunately, synthesizing the dataset and applying WaveCluster 
on top of it often render WaveCluster results useless, since these 
differentially private data publishing techniques do not capture the 
essence of WaveCluster and introduce too much unnecessary noise 










for WaveCluster. 

Another important line of prior work focuses on integrating dif¬ 
ferential privacy into other practical data analysis tasks, such as 
regression analysis, model fitting, classification and etc. Chaud- 
huri et al. |[8| proposed a differentially private regularized logistic 
regression algorithm that balances privacy with leamability. Zhang 
et al. 1^ proposed a differentially private approach for logistic and 
linear regressions that involve perturbing the objective function of 
the regression model, rather than simply introducing noise into the 
results. Friedman et al. (D incorporated differential privacy into 
several types of decision trees and subsequently demonstrated the 
tradeoff among privacy, accuracy and sample size. Using decision 
trees as an example application, Mohammed et al. dD investigated 
a generalization-based algorithm for achieving differential privacy 
for classification problems. 

Differentially private cluster analysis has also be studied in prior 
work. Zhang et al. ED proposed differentially private model fitting 
based on genetic algorithms, with applications to /:-means cluster¬ 
ing. McSherry p9) introduced the PINQ framework, which has 
been applied to achieve differential privacy for /:-means clustering 
using an iterative algorithm pF| . Nissim et al. proposed the 
sample-aggregate framework that calibrates the noise magnitude 
according to the smooth sensitivity of a function. They showed that 
their framework can be applied to ^-means clustering under the as¬ 
sumption that the dataset is well-separated. These research efforts 
primarily focus on centroid-based clustering, such as /:-means, that 
is most suited for separating convex clusters and presents insuffi¬ 
cient spatial information to detect clusters with complex shapes, 
e.g. concave shapes. In contrast to these research efforts, we pro¬ 
pose techniques that enforce differential privacy on WaveCluster, 
which is not restricted to well-separated datasets, and can detect 
clusters with arbitrary shapes. 

3. PRELIMINARIES 

In this section, we first present the background of differential 
privacy. Then we describe the WaveCluster algorithm followed by 
our problem statement. 

3.1 Differential Privacy 

Differential privacy (13 is a recent privacy model, which guar¬ 
antees that an adversary cannot infer an individual’s presence in a 
dataset from the randomized output, despite having knowledge of 
all remaining individuals in the dataset. 

Definition 1. (e-differential privacy): Given any pair of neigh¬ 
boring databases D and D' that differ only in one individual record, 
a randomized algorithm A is e-differentially private iff for any S C 
Range{A): 

Pr[A{D) eS]< Pr[A{D') e S] ^ P 

The parameter e indicates the level of privacy. Smaller e provides 
stronger privacy. When e is very small, ^ 1+ e. Since the value 
of e directly affects the level of privacy, we refer to it as the privacy 
budget. Appropriate allocation of the privacy budget for a com¬ 
putational process is important for reaching a favorable trade-off 
between privacy and utility. The most common strategy to achieve 
e-differential privacy is to add noise to the output of a function. The 
magnitude of introduced noise is calibrated by the privacy budget e 
and the sensitivity of the query function. The sensitivity of a query 
function is defined as the maximum difference between the outputs 
of the query function on any pair of neighboring databases.: 

= max II /(D) - /(D') ||i 


parameters: num grid g, density threshold p 
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Figure 3: Illustration of WaveCluster. 


There are two common approaches for achieving e-differential 
privacy: Laplace mechanism (T4) and Exponential mechanism pO) . 

Laplace Mechanism: The output of a query function / is per¬ 
turbed by adding noise from the Laplace distribution with proba¬ 
bility density function f{x\h) — ^ exp(—^), b — The fol¬ 
lowing randomized mechanism Ai satisfies e-differential privacy: 

MD) = f{D) + Lap{^) 

Exponential Mechanism: This mechanism returns an output 
that is close to the optimum, with respect to a quality function. 

A quality function q{D^ r) assigns a score to all possible outputs 
r E R, where R is the output range of /, and better outputs receive 
higher scores. A randomized mechanism Ae that outputs r ^ R 
with probability 

Pr[Ae{D) =r]cc 

satisfies e-differential privacy, where S{q) is the sensitivity of the 
quality function. 

Differential privacy has two properties: sequential composition 
and parallel composition. Sequential composition is that given n 
independent randomized mechanisms Ai, A 2 ,..., An where Ai 
(I < i < n) satisfies e/-differential privacy, a sequence of Ai over 
the dataset D satisfies e-differential privacy, where e = (^0- 

Parallel composition is that given n independent randomized mech¬ 
anisms Al, A 2 ,..., An where Ai(l < i < n) satisfies e-differential 
privacy, a sequence of Ai over a set of disjoint data sets Di satisfies 
e-differential privacy. 

3.2 WaveCluster 

WaveCluster is an algorithm developed by Sheikholeslami et al. |[^ 
for the purpose of clustering spatial data. It works by using 
a wavelet transform to detect the boundaries between clusters. A 
wavelet transform allows the algorithm to distinguish between ar¬ 
eas of high contrast (high frequency components) and areas of low 
contrast (low frequency components). The motivation behind this 
distinction is that within a cluster there should be low contrast and 
between clusters there should be an area of high contrast (the bor¬ 
der). WaveCluster has the following steps as shown in Figure 

Quantization: Quantize the feature space into grids of a speci¬ 
fied size, creating a count matrix M. 

Wavelet Transform: Apply a wavelet transform to the count 
matrix M, such as Haar transform 0 and Biorthogonal transform 1^ , 
and decompose M to the average subband that gives the approxi- 











































































mation of the count matrix and the detail subband that has the in¬ 
formation about the boundaries of clusters. We refer to the average 
subband as the wavelet-transformed-value matrix (W). 

Significant Grid Identification: Identify the significant grids 
from the average subband W. WaveCluster constructs a sorted list 
L of the positive wavelet transformed values obtained from W and 
compute the pth percentile of the values in L. The values that are 
below the pth percentile of L are non-significant values. Their cor¬ 
responding grids are considered as non-significant grids and the 
data points in the non-significant grids are considered as noise. 

Cluster Identification: Identify clusters from the significant grids 
using connected component labeling algorithm f23\ (two grids are 
connected if they are adjacent), map the clusters back to the orig¬ 
inal multi-dimensional space, and label the data points based on 
which cluster the data points reside in. 

In WaveCluster, users need to specify four parameters: 

num_grid (pi, p2, • • •, Pn): the number of grids that the n- 
dimensional space is partitioned into along each dimension. For the 
brevity of description, we simply use g to refer to the partitions of 
the n-dimensional space (pi,p 2 , • • •, Qn)- This parameter controls 
the scaling of quantization. Inappropriate scaling can cause prob¬ 
lems of over-quantization and under-quantization, affecting the ac¬ 
curacy of clustering p6) . 

density threshold (p): a percentage value p that specifies p% of 
the values in L are non-significant values. For ease of presentation, 
we use k = (1 — p) |L| to represent the top k values in L and their 
corresponding grids are considered as significant grids. 

level: a wavelet decomposition level, which indicates how many 
times a wavelet transform is applied. The larger the level is, the 
more approximate the result is. In our techniques, we set level to 1 
since a smaller level value provides more accurate results p^ . 

wavelet: a wavelet transform to be applied. Haar transform 0 
is one of the simplest wavelet transforms and widely used, which 
is computed by iterating difference and averaging between odd and 
even samples of a signal (or a sequence of data points). Other com¬ 
monly used wavelet transforms include Biorthogonal transform (28| , 
Daubechies transform |TT| , and so on. 

Motivating Scenario. Consider a scenario with two partici¬ 
pants: the data owner (e.g. hospitals) and the querier (e.g. data 
miner). The data owner holds raw data and has the legal obliga¬ 
tion to protect individuals’ privacy while the querier is eager to 
obtain cluster analysis results for further exploration. The goal of 
our work is to enable the data owner to release cluster analysis re¬ 
sults using WaveCluster while not compromising the privacy of any 
individual who contributes to the raw data. The data owner has a 
good knowledge of the raw data and it is not difficult for her to pick 
the appropriate parameters (e.g. num_grid, density threshold, and 
wavelet) for non-private WaveCluster. For example, the data owner 
may draw from her past experience on similar data to determine 
the appropriate parameters for the current dataset. The parameters 
picked for the non-private setting are directly used for the private 
setting, and thus the data owner does not need to infer another set 
of parameters for the private setting. 

Problem Statement. Given a raw data set D, appropriate WaveClus¬ 
ter parameters for D and a privacy budget e, our goal is to investi¬ 
gate an effective approach A such that A (1) satisfies e-differential 
privacy, and (2) achieves high utility of the private WaveCluster 
results with regard to the utility metrics U. 

4. APPROACHES 

In this section, we present four techniques for achieving differ¬ 
ential privacy on WaveCluster. We first describe the Baseline tech¬ 
nique that achieves differential privacy through synthetic data gen¬ 


eration. We then describe three techniques that enforces differential 
privacy on the key steps of WaveCluster. 

4.1 Baseline Approach (Baseline) 

A Straightforward technique to achieve differential privacy on 
WaveCluster is as follows: (1) adapt an existing e-differential pri¬ 
vacy preserving data publishing method to get the noisy description 
of the data distribution in some fashion, such as a set of contingency 
tables or a spatial decomposition tree p0l[^[^[40) ; (2) generate 
a synthetic dataset according to the noisy description; (3) apply 
WaveCluster on the synthetic dataset. We refer to this technique as 
Baseline, and its pseudocode is shown in Algorithm^ 


Algorithm 1 Baseline 


Input: Dataset D, num_grid g, density threshold p, wavelet trans¬ 
form w, differential privacy budget e 
Output: A set of differentially private clusters 


procedure Baseline(D, g,p,w,e) 

D' = DiffPrivPublishing(T), e) 

M' = Quantization(T)', g) 

W' = WaveletTransform(M',u;) 

L' = ConvertToPosSortedArray(FF') 
k' = {l-p)\L'\ 
d' = Top(k\L') 

return ConnCompLabel(iy', d') 

end procedure 


Baseline first leverages a e-differential privacy preserving data 
publishing method to obtain a noisy dataset D' (Line 2) and parti¬ 
tions D' based on the number of grids g to obtain the noisy count 
matrix M' (Line 3). Baseline then applies a wavelet transform on 
M' to obtain W' (Line 4). W' is then turned into a list L' that 
keeps only positive values and the values in L' is sorted in ascend¬ 
ing order (Line 5). With L', k' is computed based on the specified 
density threshold p and the size of L' (Line 6). Finally, Baseline 
obtains d' as the top k'\h value in L' (Line 7), where any value in 
L' greater than d' is considered as a significant value, and applies 
the connected component labeling algorithm to identify clusters of 
significant grids (Line 8). 

Discussion. Baseline achieves differential privacy on WaveClus¬ 
ter through the achievement of differential privacy on data publish¬ 
ing. However, it does not produce accurate WaveCluster results in 
most cases. The adapted e-differential privacy preserving data pub¬ 
lishing method is designed for answering range queries. The noisy 
descriptions of the data distribution generated by the method may 
contain negative counts for certain partitions since the noise distri¬ 
bution is Laplacian with zero mean. These negative counts do not 
affect the range query accuracy too much since zero-mean noise 
distribution smooths the effect of noise. For example, a partition 
Pi has the true count of 2 and the noisy count of -2, whose noise 
is canceled by another partition p 2 having the true count of 10 and 
the noisy count of 14 when both pi and p 2 are included in a range 
query. In particular, when the range query spreads large range of a 
dataset, a single partition with a noisy negative count does not af¬ 
fect its accuracy too much. However, when the method is used for 
generating a synthetic dataset, the noisy negative counts are reset as 
zero counts, causing the data distribution to change radically on the 
whole and further leading to the severe deviation in differentially 
private WaveCluster results. 

4.2 Private Quantization (privQi) 

To address the challenge faced by Baseline, we propose tech¬ 
niques that enforce differential privacy on the key steps of WaveClus- 








ter. Our first approach, called Private Quantization (PrivQT), in¬ 
troduces independent Laplacian noise in the quantization step to 
achieve differential privacy. In the quantization step, the data is di¬ 
vided into grids and the count matrix M is computed. To ensure 
differential privacy in this step, we rely on the Laplace mechanism 
that introduces independent Laplacian noise to M. Clearly, if we 
change one individual in the input data, such as adding, removing 
or modifying an individual, there is at most one change in one entry 
of M. According to the parallel composition property of differen¬ 
tial privacy, the noise amount introduced to each grid is Lap{^), 
given a privacy budget e. Since the following steps of WaveCluster 
are carried on using the differentially private count matrix M\ the 
clusters derived from these steps are also differentially private. Al¬ 
gorithm shows the pseudocode of PrivQT. Except from the first 
step that introduces independent Laplacian noise to M (Line 2), the 
other steps (Lines 3-7) are the same as Baseline. 


Algorithm 2 PrivQT 

Input: Dataset D, num_grid g, density threshold p, wavelet trans¬ 
form w, differential privacy budget e 
Output: A set of differentially private clusters 
1: procedure PrivQT{D, g,p,w,e) 

2: M' = PrivQuantization(D, p, e) 

3 : IL'= WaveletTransform(M',u;) 

4: L' = ConvertToPosSortedArray(IL') 

5: k' = {l-p)\L'\ 

6 : d' = TopikW) 

1 : return ConnCompLabel(IL', d') 

8: end procedure 


Selecting the appropriate grid size (refiected by the parameter 
num_grid g) in the quantization step strongly affects the accuracy 
of WaveCluster results p^ , and also the differentially private 
WaveCluster results. A small grid size (small g) causes more data 
points to fall into each grid and thus the count of data points for 
each grid becomes larger, which makes the count matrix M resis¬ 
tant to Laplacian noise. However, the small grid size is not helpful 
for WaveCluster to detect clusters with accurate shapes and ren¬ 
ders the results less useful. On the other hand, although posing a 
larger grid size on the data captures the density distribution of the 
data more clearly, it makes each grid’s count too small and thus be¬ 
come sensitive to Laplacian noise, which dramatically affects the 
identification of significant grids and further the shapes of clusters. 
Our empirical results show that only when an appropriate grid size 
is given, differentially private WaveCluster results maintains high 
utility. 

Discussion. Although PrivQT achieves differential privacy on 
the WaveCluster results, the noisy count matrix M' and its resulting 
noisy L' are significantly distorted and consequently the clustering 
results. The reason is as follows. Given a specified percentage 
value p, PrivQT computes k' from the positive values in W\ where 
W' is derived from M', which is perturbed by Laplacian noise. 
Laplacian distribution is symmetric and has zero-mean. According 
to its randomness, approximately half of the zero-count grids be¬ 
come noisy positive-count grids due to positive noise while the re¬ 
maining ones are turned into noisy negative-count grids due to neg¬ 
ative noise. These noisy positive-count grids may cause their cor¬ 
responding wavelet transformed values in W' to become positive 
(depending on the targeted wavelet transform), which will inappro¬ 
priately participate in the computation of k' and further distorts k'. 
Due to the dominating errors introduced by approximately half of 
zero-count grids becoming noisy positive-count grids, our empir¬ 


ical results show that the utility of private WaveCluster results by 
PrivQT improves marginally even for a large privacy budget. 


4.3 Private Quantization with Refined Noisy 
Density Threshold (privTHR) 

The limitation of PrivQT lies in the severe distortion of k' by 
Laplacian noise introduced into count matrix M'. To mitigate the 
distortion, we propose a technique, PrivTHR, which prunes a por¬ 
tion of noisy positive values in W' to refine the computation of k'. 
Algorithm]^ shows the pseudocode of PrivTHR. 

PrivTHR first introduces random noise to the count matrix M, 
similar to PrivQT, and obtains a noisy count matrix M' (Line 2). 
PrivTHR then applies a wavelet transform on M' to obtain W' 
(Line 3). W' is then turned into a list L' that keeps only posi¬ 
tive values and the values in L' is sorted in ascending order (Line 
4). Thus, only the positive values in W' will be used for computing 
k' based on the specified density threshold p. To reduce the distor¬ 
tion of k\ starting from the smallest noisy positive values in L', 
PrivTHR discards the first values (Line 6 ), where Z represents 
the non-positive (negative or zero) values in the W and \Z\' is a 
noisy estimate of \Z\ (Line 5). The reason why PrivTHR rem oves 




values from L' is based on the utility analysis (in Section 

\Z1 


5.2 1 


that approximately hf non-positive values in W are turned into 
positive values due to the randomness of Laplacian noise. Since 
jZj partially describes the data distribution and releasing jZj with¬ 
out protection may leak private information, PrivTHR also intro¬ 
duces Laplacian noise to |Z|, ensuring the whole process correctly 
enforces differentially privacy (Lines 11-17). The noise introduced 
to |Z| depends on the wavelet transform used to compute W. For 
example, if we use Haar transform for n-dimensional data, a value 
in W is computed by applying average for two neighboring ele¬ 
ments along each dimension. Since any single change in the input 
only causes one entry of the count matrix M to change by 1, the 
change of M causes at maximum one value in W to change, and 
thus causes \Z\io change by 1 at maximum, i.e., the sensitivity of 
|Z| is iQ Finally, PrivTHR obtains d' as the top k'ih value in L" 
(Line 8 ), where any value in L" greater than d' is considered as 
a significant value, and applies the connected component labeling 
algorithm to identify clusters of significant grids (Line 9). 

Budget Allocation. PrivTHR first introduces Laplacian noise 
in the quantization step using a privacy budget ei = ae, where 
0 < a < 1. In the significant grid identification step, PrivTHR 
further introduces Laplacian noise io\Z\ using the remai ning p ri- 
vacy budget (1 — a)e. Based on utility analysis in Section 5.2.2| ei 
requires a smaller amount of budget than 62 . Our empirical results 
in Section 1^ further show in detail the impact of a on clustering 
accuracy. 


4.4 Private Quantization with Noisy Thresh¬ 
old using Exponential Mechanism (privTHR£;M) 

Besides pruning noisy positive values in W\ we propose an al¬ 
ternative technique that employs Exponential mechanism for deriv¬ 
ing k' from the sorted list of L. Algorithm]^ shows the pseudocode 
of PrivTHR£;M. 

PrivTHR£;M first introduces Laplacian noise to the count matrix 
M, which is similar to PrivQT and PrivTHR. After that, we obtain a 
noisy count matrix M' (Line 2) and the corresponding W' (Line 3). 
Different from the previous two techniques that compute k' from 

^For other wavelet transforms that use circular convolutions, such 
as Biorthogonal transform, the sensitivity of n depends on the count 
of positive values and the count of negative values in the matrix 
computed by the coefficient vector p 8 ) . 











Algorithm 3 PrivTHR 


Input: Dataset D, num_grid g, density threshold p, wavelet trans¬ 
form w, differential privacy budget e, allocation percentage a 
Output: A set of differentially private clusters 
1 : procedure PrivTHR{D, g,p,w,e,a) 

2: M' = PrivQuantization(D, g, ae) 

3: VP'= WaveletTransform(M',u;) 

4: L' = ConvertToPosSortedArray(VP') 

5: \Z\' = NOISYCOUNTOFNONPOSVALUES(D,^,r(;, (1 - 

a)e) 


6 

1 

8 

9 

10 
11 
12 

13 

14 

15 

16 
17 


L” = RemoveFrom(L',0,^l^) 
k' = {l-p)\L"\ 
d' = Top(fc',L") 

return ConnCompLabel(FF', d!) 

end procedure 

procedure NoisyCountOfNonPosValues(D, g, w, e) 
M = Quantization(D, g) 

W = WaveletTransform(M,r(;) 

\Z\ = CountOfNonPos(VP) 

iz|' = |Z| + 

return \Z\' 
end procedure 


VP', PrivTHR£;M derives k' from VP using Exponential mechanism 
(Lines 7-15). In this case, although the sorted list derived from 
VP' is severely distorted in PrivTHR£;M, the derivation of k' from 
VP is not affected by the distorted VP' at all. Given reasonable 
privacy budget, k' derived from VP using Exponential mechanism 
is reasonably accurate, compared to the case when k' is derived 
from VP'. 

The quality function fed into the Exponential mechanism is |10| : 

q(L,X) = —\rank{x) — k\, 

where L represents the sorted positive values in VP with Min and 
Max values (Line 10), and X represents the possible output space, 
i.e., all the possible values in the range of (0, Max]. Given a VP 
with m positive values and their relationships are > X 2 > ... > 
Xm, these m values divide the range (0, Max] into m partitions: 
(0, Xm], {xm-i^Xm- 2 ]^ ..., (x 2 , xi], and the ranks for these par¬ 
titions are m, m — 1, ..., 2, 1. For any x G (x^-i, x^j, its rank is 
rank{xi). For example, if x G (x 2 , xi], ran/c(x) = rank{xi) = 

1. Similar to PrivTHR, when using Haar transform, any single 
change in the input causes only one value in VP to change. Thus, at 
maximum one value will be added into or removed from L, causing 
the outcome of g(L, X) to be changed by 1, i.e., the sensitivity of 
q{L,X) islQ 

Plugging in the above quality function into Exponential mecha¬ 
nism, we obtain the following algorithm: for any value x G (0, Max], 
the Exponential mechanism (EM) returns x with probability 
Pr[EM{L) = x](x exp(- ) (Line 12). Since all the 
values in a partition have the same probability to be chosen, a ran¬ 
dom value from the partition Pti = (xi_i, x^] will be chosen with 
the probability proportional to \Pti\ * exp(—||z — A:|). In other 
words, once k' is chosen, PrivTHR£;M further computes a uniform 
random value d' from Pti (Line 13), and any value in L' greater 
than d' is considered as a significant value. 

Budget Allocation. Similar to PrivTHR, the privacy budget is 

^Similar to PrivTHR, for other wavelet transforms that use circu¬ 
lar convolutions, the sensitivity of q{L^X) depends on the count 
of positive values and the count of negative values in the matrix 
computed by the coefficient vector p8) . 


split between two steps: introduction of Laplacian noise in quan¬ 
tization and obtaining k' using Exponential mechanism. Previous 
empirical experiments on splitting budgets between obtaining 
noisy median and noisy counts suggest that, 30% vs. 70% bud¬ 
get allocation strategy performs best. Specifically, 70% of budget 
is allocated for obtaining noisy count matrix M' (Line 2) and the 
remaining budget is allocated for computing k' (Line 4). 


Algorithm 4 PrivTHR£;M 

Input: Dataset D, num_grid g, density threshold p, wavelet trans¬ 
form w, differential privacy budget e, allocation percentage a 
Output: A set of differentially private clusters 
1: procedure PrivTHREM(D, g,p,w,e,a) 

2: M' = PrivQuantization(D, g,ae) 

3: PL'= WaveletTransform(M',i(;) 

4: d' = NOISYDENSITYTHRESHOLD(i9,p,p,i(;, (1 - a)e) 

5: return ConnCompLabel(PL', d') 

6: end procedure 

7: procedure NoisyDensityThreshold(D, g, p, w, e) 

8: M = Quantization(D, g) 

9: W = WaveletTransform(M,r(;) 

10: L = ConvertToPosSortedArray(PL) 

11: k = il-p)\L\ 

12: k' = ExponentialMechanism(L,A;,e) 

13: d' = UniformRandom(L, k ') 

14: return d' 

15: end procedure 


5. PRIVACY AND UTILITY ANALYSIS 

In this section, we present the theoretical analysis of proposed 
techniques PrivQT, PrivTHR and PrivTHR£;M. 

5.1 Privacy Analysis 

In this part we establish the privacy guarantee of PrivQT, PrivTHR 
and PrivTHR£;M- 

Theorem 1. PrivQT is e-dijferentially private. 

Proof. PrivQT introduces independent Laplacian noise Lap{ \) 
to grid counts, which are computed on disjoint datasets. Accord¬ 
ing to the parallel composition property of differential privacy de¬ 
scribed in Section [TT] the privacy cost depends only on the worst 
guarantee of all computations over disjoint datasets. Therefore, 
PrivQT is e-differentially private. □ 

Theorem 2. PrivTHR is e-dijferentially private. 

Proof. PrivTHR splits privacy budget into two parts. First, 
for private quantization, adding Laplacian noise Lap(^) achieves 
strict ae-differential privacy. The proof is same as PrivQT. Second, 
PrivTHR introduces Laplacian noise Lap{-^j^^) to the true count 
of non-positive values in W, which achieves (1 — a) e-differential 
privacy. Using the composition property of differential privacy, 
PrivTHR achieves e-differentially private since e = ae + (1 — 
a)e. □ 

Theorem 3. PhvTHRem A e-differentially private. 

Proof. Similar to PrivTHR, PrivTHR£;M has two steps of ran¬ 
domization: private quantization and obtaining noisy density thresh¬ 
old d'. Private quantization achieves ae-differential privacy ac¬ 
cording to Laplace mechanism and parallel composition property. 
Sampling noisy density threshold d' by Exponential mechanism 















consumes budget of (1 — a)e, which achieves (1 — a)e-differential 
privacy. According to the composition property of differential pri¬ 
vacy, PrivTHR£;M is e-differentially private. □ 

5.2 Utility Analysis 

In this section, we present utility guarantees of our algorithms 
(PrivQT, PrivTHR and PrivTHR£;M) with theoretical analysis. In 
WaveCluster, the step of significant grid identification determines 
the clustering results. In the private results of WaveCluster, PrivQT, 
PrivTHR and PrivTHR£;M return a list of noisy significant grids. To 
quantify the utility of PrivQT, PrivTHR and PrivTHR£;M, we con¬ 
sider finding significant grids whose wavelet transformed values 
surpass a threshold to be similar to finding the iop-k frequent item- 
sets whose frequencies surpass a threshold. In significant grid iden¬ 
tification, L is the list of positive wavelet transformed values from 
W sorted in ascending order, Z represents the set of zero values 
from TC, and k indicates the threshold position in L and all the top- 
k values in L correspond to significant grids, where k = (1 —p) |L|. 
One parameter to specify k is the density threshold p, which re¬ 
mains the same either with or without noise introduction. How¬ 
ever, |L|, another parameter to determine k, will be changed to \L'\ 
under differential privacy, where L' is the list of positive wavelet 
transformed values from W' sorted in ascending order. L is dif¬ 
ferent from L' since noise introduction might result in a portion of 
zero values in Z becoming positive and a small portion of positive 
values in L becoming non-positive. 

5 . 2.7 Utility Analysis for PrivQT. 

We first provide the analysis of difference between k and k' in 
PrivQT. In PrivQT, the difference between k' and k depends on two 
factors: ( 1 ) a set of zero values in Z becoming noisy positive, 
Z; = {W'z\W'z = Wz + Noise, Wz e Z,W'z > 0 }, where 
W'z is the noisy value of zero value in Z, and (2) a set of posi¬ 
tive values in L becoming noisy non-positive, L'^ = {Wl\Wl = 
Wl + Noise, Wl G L, W'l < 0 }, where W'l is the noisy value 
of positive value in L. That is, k' — (1 — p)(|L| + \Z'f — |L^|). 

Analysis of |Zp|. In PrivQT, since we are adding Lap(^) noise 
to each grid count and the Haar transform computes the average 
from four adjacent grids, the noise added into a wavelet trans¬ 
formed value is the sum of four i.i.d. samples from the Laplace 
distribution. The sum of h i.i.d. Laplace distributions with mean 
0 is the difference of two i.i.d. Gamma distributions f26\ , referred 
to as distribution T. Distribution T is a polynomial in |x| divided 
by which is a symmetric function and thus the probability for 
distribution T to produce positive values is |. Thus, the events 
of values in Z adding positive noise from distribution T conform 
to the Binominal Distribution with parameters |Z| and | and its 
expected value is ^. 

Analysis of |L^|. For L^, each value is added the noise con¬ 
forming to the symmetric distribution T. The probability density 
functionof |L;| is/|i,;|(a;) = (1^') friv < -Wi) IlEL+i 
friy > —Wi),Wi G L, and its expected value L^[|L^|] is 
friy < —Wi). L^[|L^|] is large when Wi is small and there is 
limited privacy budget. Consider an extreme case that might not be 
suitable for clustering. All the positive values in L are the mini¬ 
mum value 0.5 due to the sum of adjacent four grid counts being 
the minimum value 1, resulting in a high |]. Clustering, espe¬ 

cially WaveCluster algorithm, is useful when the dataset has dense 
areas (clusters) and empty areas (gap between clusters). Such ex¬ 
treme case is not suitable for clustering since its data distribution is 
close to uniform distribution. Those datasets that are interesting for 
clustering always have highly dense cluster centers and cluster bor¬ 
ders with low density. Only those values corresponding to border 


grids are possible to become noisy non-positive and the size of bor¬ 
der grids is relatively small. Therefore, |] is a small constant. 

We refer to the value of |L^ | as ^ in the following analysis. 

Analysis of k' — k. In PrivQT, E[k' — k] — (1 — p)(^ — ^)- 
There are two extreme cases when k' — k ^ t). For one extreme, 
|Z| = \L\ and all the positive values in L is the minimum value 
0.5. When e = 1, ^ ~ 0.43|L| which makes k' — k ^ t). 

For another extreme, |Z| = 0, all the positive values in L is large, 
e.g. > 15. When e = 1, ^ 0 and k' — k ^ t). For those datasets 

that are interesting in the context of clustering, |Z| is pretty large 
compared to the whole space since Z is used to separate different 
clusters. What is more, dense areas within clusters are typically 
larger than the space of cluster borders with low density, i.e. 0 
is far smaller than hf. In PrivQT, hf dominates the difference 
between k' — k, which increases false positive rate. In PrivTHR and 
PrivTHR£;M, we use different strategies to minimize the difference 
between k' — k. 


Theorem 4. In PrivQT with Haar transform, given 0 < cc < 


1 , let ? 7 i = 


\z\ , 

?72 = V + 






, and 7 = 


2 ’ — 2 ' V 2 

I In (l^L^lddZJl)^ probability at least (1 — cc)^, (1) all 

values in L greater than Wy + 7 are output, where k'i^ = 

min 

/c + (1 — p){rii — 0), and (2) no values in L less than Wy^^^ — 7 
are output, where k'^ax = A: + (1 — p){r ]2 — 0). 

Prooe. In PrivQT,/ c' = (1-p)(|L| + |Z;| - |L;|). Since |Z;| 
follows Binominal distribution with parameters |Z| and | and |L^ | 
is noted as a small value 0, k' follows the Binomial distribution and 
decides the number of values in L that become output. Given cc, 
we can derive k'’s lower bound kW^, and show that values greater 
than Wy +7 are output, i.e., subclaim (1). Let l—u — Pr(k' > 
k'min) > m). As Pr(|z;| >pi) = \- Pr(|z;| < ryi) 


andPr(|Zp| < 771 ) < e 


(- 2 - 


1^1 


4 




a 


we have 771 = ^ — 


. For constant cc, 771 = 0(^ — \/^) will suffice. 


Similar as k'^ 

added to each value in L U Z based on uj. For Haar wavelet trans¬ 
form, each value in L U Z is added the noise that is the sum of 

4 Laplacian random variables divided by 2 (i.e., ^^^|^). For 
values in L U Z, let all 4(|L| + |Z|) Laplacian random variables 
generate noise within [—7, 7]. The probability that no Laplacian 
random variable’ value is outside [—7,7] is 1 — Pr(A), where 
A is that at least one Laplacian random variable’s value is out¬ 


side i 


By union bound, Pr(A) < Ei=i 
where Bi is that zth Laplacian random variable’s noise is outside 
[—J] and Pr{Bi) — e~^ . Thus, we can derive that with at 
least the probability 1 — 4 (|L| + |Z|)e““?,no Laplacian random 
variable’ value is outside [—7, 7], and each value in L U Z has 
their noise amount within [—7, 7]. Let u — 4 (|L| + |Z|)e““?, 
then — ^ = In ( 


and we have 7 = - In 


^ 4(|L| + |Z|) > 


^{\L\ + \Z\)^ 

For constant cc, 7 = Q( ^N^^\P + \z\)) y 

Subclaim (1) can be derived based on (a) with probability at least 
1 — uo,k' > kW^ and (b) with probability at least 1 — cc, the noise 
of each value in L being within [—7, 7]. Detailed proof is omitted 
here. Subclaim (1) requires both conditions (a) and (b) to hold, and 
thus the probability is at least (1 — cc)^. 

We can derive the upper bound k'^ax of A:' given uj. Let 1 — cc = 
Pr(k' < k'max) = ^ V^)- Recall that \Zp\ follows 

Binomial distribution (|Z|, 7), and Binomial distribution (|Z|, 7) 

I Z I 

is symmetric with respect to Thus, the probability of sam- 














pling a value from the range [0,772] is the same as sampling a 
value from the range [771, |Z|], and we have 772 = |Z| — 771 = 

M ^ 1^1 For constant cj, 772 = 0 (^ + will 

suffice. 

Subclaim ( 2 ) can be proved based on (c) with probability at least 
1 — uj, k' < k'^ax and (b) with probability at least 1 — u, the 
noise of each value in L being within As subclaim ( 2 ) 

requires both conditions (c) and (b) to hold, the probability is at 
least (1 — cj)^. □ 

For other wavelet transforms that use circular convolutions, such as 
Biorthogonal transform, the derivation for the bounds of k' with 771 
and 772 remains the same since \Zp\ following Binomial distribu¬ 
tion is independent of any wavelet transform being adapted. Thus, 
our framework is extensible to other wavelet transforms, and the 
bound of noise magnitude 7 depends on the amount of adjacent 
grid counts involved in computing a wavelet transformed value. 

5 . 2.2 Utility Analysis for PrivTHR. 


Let 1 — ct; be the probability of sampling a k' where k — k' < 771, 
len , 

771 < L - /c - 1 + — ln(^-——- 

€2 \Wk\uJ 

dr constant Cc;, 771 = 0 (|L| — /c + ^ ln( ) will suffice. 

Let 1 — a; be the probability of sampling a k' where k' — k < 772, 

len eo. 

< -!— 


\Wma:c-Wk\ 




For constant cu, m = 0 (k + — ln( |„. 1 ) will suffice. The 

^ € 2^1 ^max ^k \ ^ 

proof of 7 and subclaims (1) and (2) are the same as Theorem 
4. □ 


Theorem 5. In PrivTHR with Haar transform, given 0 < u < 


1 , let ryi = ^ — 


* = ip + 


7 = 


A In () and ^ In (^), then with probability at least 
(1 — cj)^, (1) all values in L greater than Wk' + 7 are output, 

min 

where k'^i^ = A: + (1 — p)( 77 i — ^ ^ — /3), and (2) no values 

in L less than — 7 are output, where k'^^x — k {I — 

p){V2-d- +/3)- 

Prooe. In PrivTHR, we allocate a for private quantization and 
62 for protecting which makes k' = {1 — p){\L\ \Zp \ —0 — 

^ + Lap{ ^)) . With the probability at least l — e~ , Lap( ^) 

_ ^ 2 ^ 

has the noise amount within p. Let 1— Ct; = l — e 2 ^ then 
we get /3 = ^ In(^). For constant uj, [3 = will suffice. 

The proofs of 771 , 772 , 7 and subclaims (1) and (2) are the same as 
Theorem 4, and 7 = Q(^M3i\I3±\Z\}l j for constant c^;. □ 

Difference between PrivTHR and PrivQT: By Theorem 4, in 

PrivQT k'^iri = k {1 - p){pi - 0) and k'^ax = /c + (1 - 
p){r}2-0). By Theorems, in PrivTHR = k-\-{l-p){r]i- 
^ ^ — /3) and k'max — A; + (1 — p){p 2 — ^ ^ + /3). 


771 = ^), 772 = 0 {^-^ + ^ ^). As we can see, by 

removing ^ zb /3 positive values from L', PrivTHR provides better 
utility guarantee than PrivQT since the difference between k' and k 

becomes =b (/3 + 0 ), where ^ is a small constant and /3 is 

small when sufficient budget 62 is provided. 


5.2.3 Utility Analysis for PrivTHR^ 


Theorem 6. In PrivTHREM w/t/i Haar transform, given 0 < 
UJ <l,letr]i = \L\-k-l + ^ ln( m = k-2 + 

— ln( 1^ I )> and y = —\n y^ff^ prob- 

ability at least — (1) all values in L greater than Wy . +7 

are output, where k'^i^ = ^ and (2) no values in L less than 
^k'max ~ ^ output, where k'^ax = A: + 772 . 

Prooe. In PrivTHR£;M, we allocate 62 for deriving k' from k 
by employing Exponential mechanism, a general method proposed 
in | [30| . The probability of selecting a rank i is \Pti \ * exp{— ^\i — 
ranE{Wk)\), where Pti is the range (Wi-i, Wf decided by the 
i — 1 th and zth wavelet transformed values. 


Analysis of PrivTHR and PrivTHR£;M- By Theorem 5 and 
Theorem 6, the accuracy for sampling k' in PrivTHR is domi¬ 
nated by ^ while in PrivTHR£;M the accuracy is dominated by 

— ln( |„. 1 ). Depending on the data distribution, PrivTHR£;M 

62 ^\^max~^k\'^ 

may present better or worse utility guarantee than PrivTHR: 

is positive when when > 1, and 

the accuracy for sampling k' in PrivTHR£;M becomes more sen¬ 
sitive to 62 than PrivTHR; ln( ) becomes negative when 

|wJ^-Wfc| bounds of utility guarantee for 

PrivTHR£;M becomes better than PrivTHR. 

Section [^demonstrates that by reducing the difference between 
k' and k, PrivTHR and PrivTHR£;M achieve more accurate results 
than PrivQT, which conforms to the above analysis. 

6. QUANTITATIVE MEASURES 

To quantitatively assess the utility of differentially private WaveClus- 
ter, we propose two types of measures for measuring the dissimi¬ 
larity between true and differentially private WaveCluster results. 

The first type, DSGc, measures the dissimilarity of the signifi¬ 
cant grids and the clusters between true and private results. The 
second type focuses on observing the usefulness of differentially 
private WaveCluster results for further data analysis. The reason 
is that a slight difference in the significant grids or clusters may 
cause a significant difference when using the WaveCluster results. 

In this paper, we choose a typical application of further data analy¬ 
sis: building a classifier from the clustering results to predict unla¬ 
beled data j^. The classifier built from true WaveCluster results is 
called the true classifier cl ft while the classifier built from differ¬ 
entially private WaveCluster results is called the private classifier 
clfp. To measure the dissimilarity between cl ft and d/p, we pro¬ 
pose two metrics: OCM and 2 CE. 

6.1 Dissimilarity based on Significant Grids 
and Clusters 

DSGc considers the dissimilarities of significant grids and clus¬ 
ters. Assume that there are t clusters of true significant grids and 
s clusters of differentially private significant grids, t might not be 
equal to s, and the cluster labels in t true clusters and s private clus¬ 
ters are completely arbitrary. To accommodate these differences, 
we adopt the Hungarian method dzl a combinatorial optimization 
algorithm, to solve the matching problem between t true clusters 
and s private clusters while minimizing the matching difference. 


























(a) DSl,g = 6A,p = 5^ 


(b) DS2,g = A0,p=\Q 


(c) DS3, g = 36, p = 23 (d) Gowalla, g = m,p = 31 


Figure 4: Illustration of datasets and their WaveCluster results. 


When cluster Ci matches to cluster Cj , we define that the dis¬ 
tance d between cluster Ci and cluster Cj is max{ \Ci\Cj\, \Cj\Ci\}. 
Consider a cluster Ci = {91,93,95} and a cluster Cj = { 91 , 95,97,99}- 
The distance d between clusters Ci and Cj is max{\{93}\,\{97,99}\} 
= 2. Given t true clusters, s private clusters, and t > s , a match¬ 
ing Mt^s of t true clusters and s private clusters is a set of cluster 
pairs, where each private cluster is matched with a true cluster. We 
then define the cost of a matching (Mcost) as the sum of all the 
distances between each cluster pair in the matching Mt^s plus the 
count of significant grids in the non-matched clusters: 


max{\CiACiJ,\Cj^\CiJ}+ \<^^\ 


Here, ix and jy indicate the subscripts of clusters in a matched 
pair. \Cz\ represents the count of significant grids in the non- 
matched true clusters. Among all the possible matchings of clus¬ 
ters, we use the Hungarian method to find the optimal matching 
with the minimum Mcost, and computed DSCc as: 


DSCc = 


Mcost 

w 


Here T denotes the set of significant grids in the true WaveClus¬ 
ter results. 


6.2 Dissimilarity based on Classifier Predic¬ 
tion 


Here ci is the count of classes in Lt and C 2 is the count of classes 
in Lp, and we assume ci > C 2 . Since there are many possible map¬ 
pings from the classes in Lt to the classes in Lp, we use the Hun¬ 
garian method to find the optimal mapping that maximizes CT. 
Based on CT and the total count of the test samples TT, we derive 
the dissimilarity OCM: 


OCM = 1 - 


CT 

TT 


When the dissimilarity is smaller, the differentially private WaveClus¬ 
ter results are more similar to the true WaveCluster results and 
maintain high utility for classification use. 

Dissimilarity of Classifiers based on 2-Combination Enumer¬ 
ation (2CE), 2CE measures the dissimilarity between cl ft and 
clfp based on relationships of every pair of test samples, i.e., whether 
two samples are in the same class. Essentially, given a pair of 
test samples A and B, we say A and B are classified consistently 
either (1) clft{A) = clft{B) and clfp{A) = clfp{B) or (2) 
clftiA) / clft{B) and dfp{A) / clfp{B). 2CF; is the ratio 
of the count of test sample pairs that are not classified consistently 
over the total number of test sample pairs, which is the set of 2- 
combination of the test samples. 2CE uses pairs of test samples 
to eliminate the need of finding the optimal matching between the 
classes predicted by cl ft and clfp. 


7 . EXPERIMENTS 


OCM and 2CE measure the dissimilarity between cl ft and 
clfp. We name this way of evaluation as “clustering-first-then- 
classification”: given a set of unlabeled data points, we use a por¬ 
tion of the data points (e.g., 90%) to compute WaveCluster results, 
where each cluster is a set of significant grids. Using the significant 
grids with cluster labels as training data, we build classifiers cl ft 
and clfp, and use them to predict the classes for the remaining data 
points (e.g., 10%). 

Dissimilarity of Classifiers based on Optimal Class Matching 

{OCM), OCM measures the dissimilarity between the two sets of 
classes predicted by cl ft and clfp for the same test samples. We 
use Lt to denote the set of classes predicted by cl ft and Lp to 
denote the set of classes predicted by clfp. Since Lt and Lp are 
completely arbitrary, we exploit the Hungarian method to find the 
optimal matching between Lt and Lp. 

Assume that a class Lt^i predicted by cl ft is matched to a class 
Lpj predicted by clfp, forming a class pair. We compute the count 
of common test samples in the class Lt^i and the class Lpj, and 
sum the common test samples in each class pair to compute CT: 

\^t,i(3Lpj\ 


We evaluate the proposed techniques using three datasets that are 
widely used in previous clustering algorithms Q, and one large 
scale dataset derived from the check-in information in Gowallqj 
geo-social networking website which was used to evaluate grid- 
based clustering algorithms in 1 ^ . 

7.1 Experiment Setup 

In our experiments, we compare the performances of the four 
techniques. Baseline, PrivQT, PrivTHR, and PrivTHR£;M, on the 
four datasets using two types of measures proposed in Section 
and provide analysis on the results. We use Haar transform as the 
wavelet transform and set the wavelet decomposition level to 1 for 
the four techniques. Baseline uses the adaptive-grid method [3^ 
for synthetic data generation. The classification algorithm used for 
measuring OCM and 2CE is C4.5 decision tree algorithm 
We conduct experiments with privacy budgets ranging from 0.1 to 
2.0; for each budget and each metric, we apply the techniques on 
each dataset for 10 times and compute their average performances. 
All experiments were conducted on a machine with Intel 2.67GHz 
CPU and 8GB RAM. 

^ https ://snap. Stanford, edu/data/loc-gowalla.html. 















♦ " ■' Baseline-k' —■— PrivQT-k' —A— PrivTHR-k' — ^ — PrivTHREM-k' k 

Figure 5: Comparing private k' of 4 techniques with true k on DSl, DS2, DS3 and Gowalla with increasing e. 


Datasets. The four clustering datasets contain different data 
shapes that are specially interesting for clustering. Figures]^ shows 
the WaveCluster results on four datasets under certain parameter 
settings of grid size g and density threshold p. Any two adjacent 
clusters are marked with different colors. The points in red color 
are identified as noise, which fall into the non-significant grids. 

DSl is a dataset containing 15 Gaussian clusters with differ¬ 
ent degrees of cluster overlapping. It contains 30000 data points. 
These 15 clusters are all in convex shapes. The center area of each 
cluster has higher density and is resistant to noise. However, the 
overlapped area of two adjacent clusters has lower density and is 
prone to be affected by noise, which might turn the corresponding 
non-significant grids into significant grids and further connect two 
separate clusters. DS2 is sl dataset with 3 spiral clusters. It contains 
31200 data points. The head of each spiral is quite close to one an¬ 
other. Some noisy significant grids are very likely to bridge the gap 
between adjacent spirals and merge them into one cluster. DS3 is 
a data dataset with 5 various shapes of clusters, including concave 
shapes. It contains 31520 data points. There are two clusters that 
both contain two sub components and a narrow line-shape area that 
bridges those two sub components. The narrow bridging area has 
low density and might be turned into non-significant grids, causing 
a cluster to split into two clusters. Gowalla is the check-in dataset 
resembling the world map, which records time and location infor¬ 
mation of users’ check-ins. We use only the location information 
for evaluation. There are about 6.4M records in total. The large 
size of the dataset makes it infeasible to run experiments with C4.5 
and Baseline due to memory constraints. Thus, similar to pT| , we 
sampled IM records from the dataset for evaluation. 

We next present the results on comparing k' and k, and then 
present the results of the two types of measures. 

7.2 Comparing Private k' With True k 

We first measure the differences between the true k and private 
k's on each dataset with e ranging from 0.1 to 2.0, and the re¬ 
sults are shown in FigureThe results show that for all datasets, 
when e > 0.5, the relative errors of k', i.e., , in PrivQT and 

PrivTHR£;M are less than 4.7% on average, while the relative er¬ 
rors of k' in Baseline and PrivQT range from 32.2% to 150.5%. 
For example, in DS2, the true k is 144. When e is 1, the av¬ 
erage private k' is 141.0 (2.1%) for PrivTHR and 142.8 (0.8%) 
for PrivTHR£;M, while Baseline and PrivQT obtain 284.0 (97.2%) 
and 249.2 (73.1%) for the average k' respectively. Note that \Z\ 
is 241 in DS2, and the difference between the average k' and k 
is 105.2 for PrivQT, which is quite close to the theoretical bound 
(1 — p)^ = 108.45 derived from our utility analysis in Section 
5.2.1. When e is 0.1, the k' in PrivTHR£;M deviates from k more 
significantly than the k' in PrivTHR, indicating that PrivTHR£;M is 


more sensitive to e than PrivTHR as discussed in Section U.2.31 For 
example, in DS2, the average k' in PrivTHR£;M is 82.8 (42.5%) 
while the average k' in PrivTHR is 131.2 (8.9%). 

7.3 Results of dsGc 

Figure shows the results of DSGc for the four techniques 
when the privacy budget ranges from 0.1 to 2.0. X-axis shows the 
privacy budgets, and Y-axis denotes the values of DSGc- As 
shown in the results, both PrivTHR and PrivTHR£;M achieve smaller 
DSGc values than Baseline and PrivQT on all four datasets for all 
budgets. The reason is that though the noisy significant grids gen¬ 
erated by Baseline and PrivQT may be similar to the true significant 
grids, these noisy significant grids result in very different shapes of 
clusters and thus result in a large value of DSGc, while PrivTHR 
and PrivTHR£;M preserves more accurate cluster shapes. For ex¬ 
ample, in DS3, the narrow line-shape areas and the gap between 
two adjacent clusters are sensitive to noise. If some noisy signifi¬ 
cant grids appear in these areas, two clusters may be merged into 
one; if some significant grids disappear due to noise, one cluster 
might be split into two clusters. Such changes cause DSGc to 
increase significantly. 

Unlike the other techniques, PrivQT benefits little from the in¬ 
creased privacy budgets. For PrivQT, the difference between k' and 

I Z I 

k in PrivQT is dominated by Increasing privacy budgets can 
only reduce noise magnitude and cannot smooth such difference. 

Comparison to F-Measure Results. Clustering analysis usu¬ 
ally uses F-measure as a representative external validations to mea¬ 
sure the similarity between the ground truth (known class labels) 
and the clustering results j^. In our experiments, we consider the 
true WaveCluster results as the ground truth, and the results of F- 
measure are shown in Figure The results show that PrivQT and 
Baseline achieve high F-measure scores (more than 0.8) for almost 
all budgets in DSl, even though the private results produced by 
PrivQT and Baseline are quite different from the true results. For 
example, when e = 0.1, the private results of PrivQT and Base¬ 
line have more than 30 clusters while the true results have only 
15 clusters. On the contrary. Figure (a) shows that DSGc is 
able to clearly differentiate the performances of the four techniques. 
The reason is that unlike DSGc that allows only one-to-one map¬ 
ping between true and private clusters, F-measure allows one-to- 
many or many-to-one mapping between true and private clusters. 
If the size of true clusters is larger than that of private clusters, 
F-measure allows many to one mapping, and vice versa. Thus, 
DSGc presents more strict evaluation than F-measure in comput¬ 
ing similarity/dissimilarity. 

7.4 Results of OCM and 2CE 

Results of OCM. Figure|^shows the results of OCM for the 
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Figure 6: Comparing DSGc of 4 techniques on DSl, DS2, DS3 and Gowalla with increasing e. 
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Figure 7: Comparing F-Measure of 4 techniques on DSl, DS2, DS3 and Gowalla with increasing e. 
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Figure 8: Comparing OGM and 2GE of 4 techniques on DSl, DS2, DS3 and Gowalla with increasing e. 









































four techniques. X-axis denotes the privacy budgets while Y-axis 
denotes the values of OCM. As shown in the results, PrivTHR 
and PrivTHR£;M achieve smaller OCM values than Baseline and 
PrivQT for all datasets when e ranges from 0.5 to 2.0. When e is 
greater than 0.5, the OCM values of PrivTHR and PrivTHR£;M are 
less than 0.15 on DSl, DS3, and Cowalla, indicating the private 
classifier clfp maintains highly similar prediction results as the true 
classifier cl ft. On DS2 that contains 3 spirals, PrivTHR£;M still 
maintains a very low OCM value (< 0.1) when e is greater than 
0.5 while PrivTHR has a slightly worse OCM value (ranging from 
0.1 to 0.2). Such results show that PrivTHR£;M is more resilient to 
noise for concave-shaped data than PrivTHR. 

Results of 2CE, Figure shows the results of 2CE for the 
four techniques. X-axis denotes the privacy budgets while Y-axis 
denotes the values of 2CE. As shown in the results, PrivTHR and 
PrivTHR£;M achieve smaller 2CE values than Baseline and PrivQT 
for all datasets when e ranges from 0.5 to 2.0. 

In general, all four techniques exhibit similar trends of 2CE as 
their trends in OCM. On DSl, all four techniques have very low 
2CE values (< 0.1) though their corresponding OCM values are 
much higher (ranging from 0.05 to 0.5). The reason is that 2CE 
captures the relationships between data points while OCM focuses 
on the mappings of classes. If there are k test samples out of N total 
samples having different prediction results in the true and private 
results, 2CE expresses the differences as C{k^ 2) + k{N — k) 
over the total combinations of test samples C(iV, 2 ), while OCM 
expresses the differences as k over N. On DSl, the k test samples 
are predicted to be in the same cluster in the private results and 
C(/c, 2) becomes close to 0. In this case, only k{N — k) matters in 
the computation of 2CE. Given that C{N,2) is much larger than 
N and k{N — k) when N of DSl is about 30,000, 2CE has a 
smaller value than OCM for measuring the differences, and thus 
is less sensitive to the noise on DSl. 

Budget Allocation for PrivTHR. Based on the utility analysis 
Section 5.2.2, ei for private quantization affects the accuracy of 7 , 
and 62 for obtaining \Z\' affects the accuracy of /3. As the constant 
factor of 7 , ^ In + )^ is larger than the constant factor of 

/3, ^ In(^), more budget should be allocated for ei to achieve 
better utility. We evaluate the values of DSCc of PrivTHR on DSl 
under different budget allocation strategies, ranging from 1% for ei 
to 99% for 61 . Based on the results, the budget allocation strategy 
with 90% for 61 and 10% for 62 performs the best. The results of 
other measures on DSl show the similar results, and the results 
of all the two types of measures on other datasets also show the 
similar results. Detailed results are omitted. 

8. CONCLUSION 

In this paper we have addressed the problem of cluster analysis 
with differential privacy. We take a well-known effective and effi¬ 
cient clusteing algorithm called WaveCluster, and propose several 
ways to introduce randomness in the computation of WaveCluster. 
We also devise several new quantitative measures for examining 
the dissimilarity between the non-private and differentially private 
results and the usefulness of differentially private results in classi¬ 
fication. In the future, we will investigate under differential privacy 
other categories of clustering algorithms, such as hierarchical clus¬ 
tering. Another important problem is to explore the applicability 
of differentially private clustering in those cases where the users do 
not have good knowledge about the dataset, and the parameters of 
the algorithms should be inferred in a differentially private way. 
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